Proteomics-Based Comparative Mapping of the Secretomes of Human Brown and White Adipocytes Reveals EPDR1 as a Novel Batokine¶

Adipokines secreted from white adipose tissue play a role in metabolic crosstalk and homeostasis, whereas the brown adipose secretome is less explored. We performed high-sensitivity mass-spec- trometry-based proteomics on the cell media of human adipocytes derived from the supraclavicular brown ad- ipose and from the subcutaneous white adipose de- pots of adult humans. We identified 471 potentially secreted proteins covering interesting categories such as hormones, growth factors, extracellular matrix proteins, and proteins of the complement system, which were differentially regulated between brown and white adipocytes. A total of 101 proteins were exclusively quantified in brown adipocytes, and among these was ependymin-related protein 1 (EPDR1). EPDR1 was detected in human plasma, and functional studies suggested a role for EPDR1 in thermogenic determination during adipogenesis. In conclusion, we report substantial differences between the secretomes of brown and white human adipocytes and identify novel candidate batokines that can be important regu- lators of human metabolism. REF: https://www.cell.com/cell-metabolism/fulltext/S1550-4131(19)30556-X

|graphical\_abstract.png|

[1]:

import os
import re
import pandas as pd
import numpy as np

import ckg.ckg_utils as ckg_utils
from ckg.graphdb_builder import builder_utils, mapping
from ckg.graphdb_builder.experiments.parsers import proteomicsParser
from ckg.graphdb_connector import connector

from ckg.analytics_core.analytics import analytics
from ckg.analytics_core.viz import viz

from ckg.report_manager import knowledge

from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
%matplotlib inline
init_notebook_mode(connected=True)

C:\Users\sande\.conda\envs\ckgenv\lib\site-packages\outdated\utils.py:18: OutdatedPackageWarning: The package pingouin is out of date. Your version is 0.3.10, the latest is 0.3.11.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.
  **kwargs
C:\Users\sande\.conda\envs\ckgenv\lib\site-packages\outdated\utils.py:18: OutdatedPackageWarning: The package outdated is out of date. Your version is 0.2.0, the latest is 0.2.1.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.
  **kwargs

WGCNA functions will not work. Module Rpy2 not installed.
R functions will not work. Module Rpy2 not installed.

[2]:

analysis_dir = '../../../../data/tmp/Deshmukh2019'
ckg_utils.checkDirectory(analysis_dir)

pxd_id = 'PXD008541'
file_name='SearchEngineResults_secretome.zip.rar'

Download Data from PRIDE¶

We can use functionality in graphdb_builder to directly download data files from EBI’s PRIDE database (https://www.ebi.ac.uk/pride/). For that you you just need to specify the PRIDE identifier for the project (PXD…) and the name of the file to download. In this case, the project identifier is PXD008541 and the file we will use is SearchEngineResults_secretome.zip.rar, a RAR compressed file.

[3]:

builder_utils.download_PRIDE_data(pxd_id=pxd_id,
                                  file_name=file_name,
                                  to=analysis_dir)

'publicationDate'

[3]:

{'status': 'INTERNAL_SERVER_ERROR',
 'code': 500,
 'message': 'Internal Server Error: Could not open JPA EntityManager for transaction; nested exception is javax.persistence.QueryTimeoutException: Could not open connection',
 'developerMessage': 'Please report to pride-support@ebi.ac.uk',
 'moreInfoUrl': None,
 'throwable': None}

Read Data In¶

Decompress File¶

[4]:

builder_utils.unrar(filepath=os.path.join(analysis_dir, file_name), to=analysis_dir)

Error: ../../../../data/tmp/Deshmukh2019\SearchEngineResults_secretome.zip.rar. Could not unrar file Unrar not installed? (rarfile.UNRAR_TOOL='unrar')

The list of files within the compressed folder can be listed using the listDirectoryFiles functionality in gaphdb_builder.

[5]:

builder_utils.listDirectoryFiles(analysis_dir)

[5]:

['anova_results.tsv',
 'BAT_woNE_vs_WAT_woNE_volcano_plot.svg',
 'missing_values_per_sample.svg',
 'original_data_boxplot.svg',
 'original_data_distplot.svg',
 'original_data_parallel_plot_upregulated_proteins.svg',
 'pca.svg',
 'processed_data_boxplot.svg',
 'processed_data_distplot.svg',
 'processed_data_matrix.tsv',
 'processed_data_parallel_plot_upregulated_proteins.svg',
 'SearchEngineResults_secretome.zip.rar',
 'secreted_processed_data.tsv',
 'secretome.tsv']

We use the proteinGroups file that contains the proteomics data processed using MaxQuant software.

[6]:

proteinGroups_file = os.path.join(os.path.join(analysis_dir, "SearchEngineResults_secretome"), 'proteinGroups.txt')

Parse Contents¶

CKG has parsers for MaxQuant and Spectronaut output files. The default configuration needed to parse these files needs to be updated with the name of the columns containing the protein quantifications for each sample. Also, the default configuration can be adapted to the experiment by selected specific filters or removing non-used columns. For example, in this study the output file did not have columns: Score, Q-value, so we removed them from the configuration and the column ‘Potential contaminant’ was renamed to ‘Contaminant’ so we changed the name in the filters.

[7]:

#d = pd.read_csv(proteinGroups_file, sep='\t')
#d.columns.tolist()

[8]:

columns = ['LFQ intensity BAT_NE1',
           'LFQ intensity BAT_NE2',
           'LFQ intensity BAT_NE3',
           'LFQ intensity BAT_NE4',
           'LFQ intensity BAT_NE5',
           'LFQ intensity BAT_woNE1',
           'LFQ intensity BAT_woNE2',
           'LFQ intensity BAT_woNE3',
           'LFQ intensity BAT_woNE4',
           'LFQ intensity BAT_woNE5',
           'LFQ intensity WAT_NE1',
           'LFQ intensity WAT_NE2',
           'LFQ intensity WAT_NE3',
           'LFQ intensity WAT_NE4',
           'LFQ intensity WAT_NE5',
           'LFQ intensity WAT_woNE1',
           'LFQ intensity WAT_woNE2',
           'LFQ intensity WAT_woNE3',
           'LFQ intensity WAT_woNE4',
           'LFQ intensity WAT_woNE5',
           'Contaminant']

[9]:

configuration = proteomicsParser.update_configuration(data_type='proteins',
                                                      processing_tool='maxquant',
                                                      value_col='LFQ intensity',
                                                      columns=columns,
                                                      regex=r'(\w+_\w+\d+)',
                                                      drop_cols=['Score', 'Q-value', 'Potential contaminant'],
                                                      filters=['Reverse', 'Only identified by site', 'Contaminant'])

[10]:

configuration

[10]:

{'columns': ['Majority protein IDs',
  'Razor + unique peptides',
  'id',
  'LFQ intensity AS\\d+_?-?\\d*',
  'Intensity AS\\d+_?-?\\d*',
  'Reverse',
  'Only identified by site',
  'is_razor',
  'LFQ intensity BAT_NE1',
  'LFQ intensity BAT_NE2',
  'LFQ intensity BAT_NE3',
  'LFQ intensity BAT_NE4',
  'LFQ intensity BAT_NE5',
  'LFQ intensity BAT_woNE1',
  'LFQ intensity BAT_woNE2',
  'LFQ intensity BAT_woNE3',
  'LFQ intensity BAT_woNE4',
  'LFQ intensity BAT_woNE5',
  'LFQ intensity WAT_NE1',
  'LFQ intensity WAT_NE2',
  'LFQ intensity WAT_NE3',
  'LFQ intensity WAT_NE4',
  'LFQ intensity WAT_NE5',
  'LFQ intensity WAT_woNE1',
  'LFQ intensity WAT_woNE2',
  'LFQ intensity WAT_woNE3',
  'LFQ intensity WAT_woNE4',
  'LFQ intensity WAT_woNE5',
  'Contaminant'],
 'generated_columns': ['is_razor'],
 'filters': ['Reverse', 'Only identified by site', 'Contaminant'],
 'proteinCol': 'Majority protein IDs',
 'contaminant_tag': 'CON__',
 'valueCol': 'LFQ intensity',
 'groupCol': 'id',
 'indexCol': 'Majority protein IDs',
 'attributes': {'cols': ['id', 'is_razor'], 'regex': ['Intensity']},
 'log': 'log2',
 'file': 'proteinGroups.txt',
 'regex': '(\\w+_\\w+\\d+)'}

When we parse the data, we obtain a matrix in an edge list following CKG’s graph format: sample, protein, realtionship_type, value, protein_group_id, is_razor

[11]:

data = proteomicsParser.parser_from_file(proteinGroups_file, configuration=configuration, data_type='proteins')[('proteins', 'w')]

[12]:

data.head()

[12]:

	START_ID	END_ID	TYPE	value	id	is_razor
0	WAT_woNE5	A0AVL1	HAS_QUANTIFIED_PROTEIN	20.109423	0	False
1	BAT_NE5	A1A441	HAS_QUANTIFIED_PROTEIN	22.986122	1	False
2	BAT_woNE3	A1A441	HAS_QUANTIFIED_PROTEIN	22.753054	1	False
3	BAT_woNE5	A1A441	HAS_QUANTIFIED_PROTEIN	22.475955	1	False
4	BAT_NE3	A1A441	HAS_QUANTIFIED_PROTEIN	23.317552	1	False

[13]:

data.columns = ['sample', 'identifier', 'relationship', 'LFQ intensity', 'id', 'is_razor']

[14]:

data.head()

[14]:

	sample	identifier	relationship	LFQ intensity	id	is_razor
0	WAT_woNE5	A0AVL1	HAS_QUANTIFIED_PROTEIN	20.109423	0	False
1	BAT_NE5	A1A441	HAS_QUANTIFIED_PROTEIN	22.986122	1	False
2	BAT_woNE3	A1A441	HAS_QUANTIFIED_PROTEIN	22.753054	1	False
3	BAT_woNE5	A1A441	HAS_QUANTIFIED_PROTEIN	22.475955	1	False
4	BAT_NE3	A1A441	HAS_QUANTIFIED_PROTEIN	23.317552	1	False

[15]:

data.shape

[15]:

(57470, 6)

[16]:

data = data[data.is_razor]

[17]:

data.shape

[17]:

(17489, 6)

We can use the sample names to extract the group information: BAT_NE, WAT_NE, BAT_woNE, WAT_woNE

With this last column, we obtain the original dataframe used as starting point in CKG’ analysis pipelines.

[18]:

data['group'] = data['sample'].apply(lambda x: re.sub('\d', '', x))

[19]:

data.head()

[19]:

	sample	identifier	relationship	LFQ intensity	id	is_razor	group
7	BAT_NE4	A1L4H1	HAS_QUANTIFIED_PROTEIN	21.686542	2	True	BAT_NE
134	WAT_NE4	A6NCN2	HAS_QUANTIFIED_PROTEIN	27.843440	10	True	WAT_NE
135	BAT_woNE2	A6NDG6	HAS_QUANTIFIED_PROTEIN	23.446827	11	True	BAT_woNE
136	BAT_NE5	A6NDG6	HAS_QUANTIFIED_PROTEIN	23.047168	11	True	BAT_NE
137	BAT_woNE3	A6NDG6	HAS_QUANTIFIED_PROTEIN	23.307164	11	True	BAT_woNE

[20]:

original = data[['group', 'sample', 'identifier', 'LFQ intensity']]

[21]:

df = analytics.transform_proteomics_edgelist(original,
                                           index_cols=['group', 'sample'],
                                           drop_cols=['sample'],
                                           group='group',
                                           identifier='identifier',
                                           extra_identifier=None,
                                           value_col='LFQ intensity')

[22]:

df.shape

[22]:

(20, 1868)

[23]:

df.describe()

[23]:

identifier	A1L4H1	A6NCN2	A6NDG6	A8K2U0	B3EWG6	B3KW70	B4DUF5	B4DUW4	D6RD97	D6REB5	...	Q9Y680	Q9Y696	Q9Y6B6	Q9Y6C2	Q9Y6C9	Q9Y6D6	Q9Y6E0	Q9Y6E2	Q9Y6G9	Q9Y6I3
count	1.000000	1.00000	8.000000	5.000000	1.000000	20.000000	1.000000	1.00000	1.000000	1.00000	...	1.000000	14.000000	1.000000	1.000000	1.000000	1.000000	2.000000	1.000000	2.000000	7.000000
mean	21.686542	27.84344	23.357560	21.966594	22.579298	28.716408	20.063026	24.12849	22.271874	20.50174	...	19.579739	25.661670	20.637148	22.149265	20.798991	21.932686	20.572377	21.000033	23.177596	22.972208
std	NaN	NaN	0.326612	0.497770	NaN	0.928132	NaN	NaN	NaN	NaN	...	NaN	0.862677	NaN	NaN	NaN	NaN	0.105813	NaN	0.040308	0.617836
min	21.686542	27.84344	22.817377	21.346029	22.579298	27.265350	20.063026	24.12849	22.271874	20.50174	...	19.579739	24.221919	20.637148	22.149265	20.798991	21.932686	20.497556	21.000033	23.149094	22.132167
25%	21.686542	27.84344	23.210751	21.830435	22.579298	28.257911	20.063026	24.12849	22.271874	20.50174	...	19.579739	25.036140	20.637148	22.149265	20.798991	21.932686	20.534966	21.000033	23.163345	22.548973
50%	21.686542	27.84344	23.376995	21.896706	22.579298	28.404056	20.063026	24.12849	22.271874	20.50174	...	19.579739	25.533329	20.637148	22.149265	20.798991	21.932686	20.572377	21.000033	23.177596	23.089712
75%	21.686542	27.84344	23.538843	22.032815	22.579298	29.141976	20.063026	24.12849	22.271874	20.50174	...	19.579739	26.233573	20.637148	22.149265	20.798991	21.932686	20.609788	21.000033	23.191847	23.282008
max	21.686542	27.84344	23.867311	22.726982	22.579298	30.812377	20.063026	24.12849	22.271874	20.50174	...	19.579739	26.981961	20.637148	22.149265	20.798991	21.932686	20.647198	21.000033	23.206099	23.921616

8 rows × 1866 columns

[24]:

aux = df.set_index(['sample']).T.isna().sum().reset_index().sort_values(by='sample')
plot = viz.get_barplot(aux, identifier='missing', args={'x':0,
                                                 'y':'sample',
                                                 'title':"Missing values per sample",
                                                 "orientation":'h',
                                                "x_title":'# missing values', 'y_title':'samples',
                                                 'height':900, 'width':400})
iplot(plot.figure)
viz.save_DASH_plot(plot, name='missing_values_per_sample', plot_format='svg', directory=analysis_dir)

[25]:

aux = df.sort_values(by='sample').set_index(['group', 'sample']).stack().reset_index()
aux.columns = ['group', 'sample', 'identifier', 'value']
plot = viz.get_boxplot_grid(aux, identifier='boxplot', args={"Title":"Boxplot", 'x':'sample', 'y':'value', 'color':'group', 'axis':'cols', 'height':200})
iplot(plot.figure)
viz.save_DASH_plot(plot, name='original_data_boxplot', plot_format='svg', directory=analysis_dir)

IN

[26]:

aux = df.sort_values(by='sample').set_index(['group', 'sample']).stack().reset_index()
aux.columns = ['group', 'sample', 'identifier', 'value']
plot = viz.get_histogram(aux, identifier='histogram', args={'x':'value', 'color':'group', 'facet_row':'sample', 'title':'Facet Grid Plot'})
iplot(plot.figure)
viz.save_DASH_plot(plot, name='original_data_distplot', plot_format='svg', directory=analysis_dir)

Data Preparation¶

In order to prepare the data we follow the steps:

Filtering based on missing values
Imputation of missing values using a mixed model estrategy: KNN and MinProb

These steps will generate the processed dataframe, a complete matrix that can be used in the exploratory and statistical analysis.

[27]:

processed_data = analytics.get_proteomics_measurements_ready(original,
                                                             index_cols=['group', 'sample'],
                                                             drop_cols=['sample'],
                                                             group='group',
                                                             identifier='identifier',
                                                             extra_identifier=None,
                                                             imputation=True,
                                                             method='mixed',
                                                             knn_cutoff=0.4,
                                                             missing_method='at_least_x',
                                                             missing_per_group=True,
                                                             min_valid=3,
                                                             value_col='LFQ intensity',
                                                             shift=1.8,
                                                             nstd=0.3)

[28]:

processed_data.head()

[28]:

identifier	group	sample	A6NDG6	B3KW70	E9PAV3	E9PGF5	E9PHK0	F5GWP8	F8W031	G3V3G9	...	Q9Y4Y9	Q9Y5P4	Q9Y5X3	Q9Y5Z4	Q9Y600	Q9Y617	Q9Y646	Q9Y678	Q9Y696	Q9Y6I3
0	BAT_NE	BAT_NE1	22.948430	29.091339	26.578739	23.563604	27.322081	27.821164	24.420405	23.947173	...	22.109509	23.130157	22.569201	28.647187	24.888148	26.012226	22.746304	23.610317	24.676053	23.408469
1	BAT_NE	BAT_NE2	22.954981	28.610591	27.468243	23.811590	26.283462	27.452462	23.071490	23.891354	...	22.109780	23.312664	22.572965	28.026016	24.164153	26.294796	22.171377	25.493292	24.928906	22.570449
2	BAT_NE	BAT_NE3	22.817377	28.272086	26.773793	23.708936	25.299149	27.471115	22.773441	23.618293	...	22.104088	23.378619	22.493896	27.956985	25.221587	26.440457	22.150598	25.144410	26.972790	22.340063
3	BAT_NE	BAT_NE4	22.937884	30.812377	26.089441	23.757767	28.160532	28.601127	23.301111	23.881632	...	22.109072	22.237510	22.563140	27.506049	24.793329	24.906694	22.693322	23.264133	25.210342	22.709767
4	BAT_NE	BAT_NE5	23.047168	28.364277	26.971255	24.055731	25.989885	26.889135	22.317248	24.034814	...	22.113593	23.397543	22.625936	28.196715	25.171234	26.397037	21.778724	24.573057	26.251168	22.132167

5 rows × 1169 columns

[29]:

processed_data.describe()

[29]:

identifier	A6NDG6	B3KW70	E9PAV3	E9PGF5	E9PHK0	F5GWP8	F8W031	G3V3G9	H0YFH3	K7ERI9	...	Q9Y4Y9	Q9Y5P4	Q9Y5X3	Q9Y5Z4	Q9Y600	Q9Y617	Q9Y646	Q9Y678	Q9Y696	Q9Y6I3
count	20.000000	20.000000	20.000000	20.000000	20.000000	20.000000	20.000000	20.000000	20.000000	20.000000	...	20.000000	20.000000	20.000000	20.000000	20.000000	20.000000	20.000000	20.000000	20.000000	20.000000
mean	23.128097	28.716408	26.027184	23.259318	26.659409	28.309229	22.856406	22.577696	22.564280	23.144775	...	22.841194	22.060772	22.595484	28.287357	24.694083	26.657958	22.913560	24.756667	25.660985	22.282274
std	0.491487	0.928132	1.136864	0.969810	0.725931	0.991160	0.747094	1.013024	1.354932	1.273632	...	1.261612	1.114248	0.570500	0.644960	0.580041	0.830303	0.777069	0.794566	0.762354	1.024486
min	21.887487	27.265350	23.243551	20.776162	25.299149	26.639618	21.241523	21.012823	19.727160	21.080893	...	20.098448	20.103549	21.390805	26.324229	23.570569	24.906694	21.778724	23.264133	24.221919	20.460160
25%	22.907757	28.257911	25.233104	22.853012	26.166929	27.678857	22.431878	21.862881	21.612875	22.236824	...	22.107826	21.197256	22.329569	28.087609	24.244477	26.240169	22.317168	24.292354	25.157052	21.392938
50%	23.286221	28.404056	25.972840	23.568937	26.705618	28.368931	22.750731	22.405116	22.260924	23.342752	...	22.955144	22.188302	22.566171	28.330765	24.752862	26.644405	22.889030	24.837943	25.533329	22.455256
75%	23.479114	29.141976	26.977802	23.819750	27.171606	28.939251	23.155947	23.501777	23.464925	23.481043	...	23.662340	23.125324	23.103945	28.546374	25.067533	27.341107	23.248793	25.396285	26.217616	23.167254
max	23.867311	30.812377	27.468243	24.479142	28.160532	30.258672	24.420405	24.034814	25.691071	27.025171	...	25.020940	23.445817	23.364528	29.240840	25.761494	27.822988	24.681853	26.156805	26.981961	23.921616

8 rows × 1167 columns

[30]:

aux = processed_data.sort_values(by='sample').set_index(['group', 'sample']).stack().reset_index()
aux.columns = ['group', 'sample', 'identifier', 'value']
plot = viz.get_boxplot_grid(aux, identifier='boxplot', args={"Title":"Boxplot", 'x':'sample', 'y':'value', 'color':'group', 'axis':'cols', 'height':200})#
iplot(plot.figure)
viz.save_DASH_plot(plot, name='processed_data_boxplot', plot_format='svg', directory=analysis_dir)

IN

[31]:

aux = processed_data.sort_values(by='sample').set_index(['group', 'sample']).stack().reset_index()
aux.columns = ['group', 'sample', 'identifier', 'value']
plot = viz.get_histogram(aux, identifier='histogram', args={'x':'value', 'color':'group', 'facet_row':'sample', 'title':'Facet Grid Plot'})
iplot(plot.figure)
viz.save_DASH_plot(plot, name='processed_data_distplot', plot_format='svg', directory=analysis_dir)

Mapping Identifiers to Names¶

CKG graph database can be used easily to map UniProt identifiers to protein names that are more commonly used and understood. We build unique identifiers by combining the UniProt ids with the names: name_id

[32]:

mapping_dict = mapping.getMappingFromDatabase(processed_data, node='Protein', attribute_from='id', attribute_to='name')

[33]:

processed_data.columns = [str(mapping_dict[c])+"_"+c if c in mapping_dict else c for c in processed_data.columns]

[34]:

processed_data.head()

[34]:

	group	sample	PGP_A6NDG6	MFAP5_B3KW70	NACA_E9PAV3	TNS1_E9PGF5	CLEC3B_E9PHK0	KRT17_F5GWP8	None_F8W031	G3V3G9	...	LSM5_Q9Y4Y9	CERT1_Q9Y5P4	SNX5_Q9Y5X3	HEBP2_Q9Y5Z4	CSAD_Q9Y600	PSAT1_Q9Y617	CPQ_Q9Y646	COPG1_Q9Y678	CLIC4_Q9Y696	EPN1_Q9Y6I3
0	BAT_NE	BAT_NE1	22.948430	29.091339	26.578739	23.563604	27.322081	27.821164	24.420405	23.947173	...	22.109509	23.130157	22.569201	28.647187	24.888148	26.012226	22.746304	23.610317	24.676053	23.408469
1	BAT_NE	BAT_NE2	22.954981	28.610591	27.468243	23.811590	26.283462	27.452462	23.071490	23.891354	...	22.109780	23.312664	22.572965	28.026016	24.164153	26.294796	22.171377	25.493292	24.928906	22.570449
2	BAT_NE	BAT_NE3	22.817377	28.272086	26.773793	23.708936	25.299149	27.471115	22.773441	23.618293	...	22.104088	23.378619	22.493896	27.956985	25.221587	26.440457	22.150598	25.144410	26.972790	22.340063
3	BAT_NE	BAT_NE4	22.937884	30.812377	26.089441	23.757767	28.160532	28.601127	23.301111	23.881632	...	22.109072	22.237510	22.563140	27.506049	24.793329	24.906694	22.693322	23.264133	25.210342	22.709767
4	BAT_NE	BAT_NE5	23.047168	28.364277	26.971255	24.055731	25.989885	26.889135	22.317248	24.034814	...	22.113593	23.397543	22.625936	28.196715	25.171234	26.397037	21.778724	24.573057	26.251168	22.132167

5 rows × 1169 columns

[35]:

processed_data.to_csv('{}/processed_data_matrix.tsv'.format(analysis_dir), sep='\t', index=False)

Principal Component Analysis¶

[36]:

pca_result, args = analytics.run_pca(processed_data, drop_cols=['sample'], group='group')
args.update({"loadings":10, "title":'PCA plot', 'x_title':'PC1', 'y_title':'PC2', 'height':600, 'width':700})

[37]:

plot = viz.get_pca_plot(pca_result, identifier='pca', args=args)
iplot(plot.figure)
viz.save_DASH_plot(plot, name='pca', plot_format='svg', directory=analysis_dir)

Secretome Human Protein Atlas¶

This project requires a filtering based on known/predicted secreted proteins (secretome). Currently, we don’t have this information in CKG’s graph database but we can use functionality in graphdb_builder to download any database (file) from an URL.

[38]:

secretome_url = 'https://www.proteinatlas.org/search/sa_location%3ASecreted+-+unknown+location%2CSecreted+in+brain%2CSecreted+in+female+reproductive+system%2CSecreted+in+male+reproductive+system%2CSecreted+in+other+tissues%2CSecreted+to+blood%2CSecreted+to+digestive+system%2CSecreted+to+extracellular+matrix%2CIntracellular+and+membrane?format=tsv'

[39]:

builder_utils.downloadDB(databaseURL=secretome_url, directory=analysis_dir, file_name='secretome.tsv', avoid_wget=True)

[40]:

secretome = pd.read_csv(os.path.join(analysis_dir, 'secretome.tsv'), sep='\t')

[41]:

secretome.head()

[41]:

	Gene	Gene synonym	Ensembl	Gene description	Uniprot	Chromosome	Position	Protein class	Biological process	Molecular function	...	Single Cell Type RNA - Rod photoreceptor cells [NX]	Single Cell Type RNA - Sertoli cells [NX]	Single Cell Type RNA - Smooth muscle cells [NX]	Single Cell Type RNA - Spermatocytes [NX]	Single Cell Type RNA - Spermatogonia [NX]	Single Cell Type RNA - Suprabasal keratinocytes [NX]	Single Cell Type RNA - Syncytiotrophoblasts [NX]	Single Cell Type RNA - T-cells [NX]	Single Cell Type RNA - Undifferentiated cells [NX]	Single Cell Type RNA - Urothelial cells [NX]
0	A1BG	NaN	ENSG00000121410	Alpha-1-B glycoprotein	P04217	19	58345178-58353499	Plasma proteins, Predicted intracellular prote...	NaN	NaN	...	5.2	25.8	52.4	1.4	5.1	3.1	1.6	34.8	0.1	0.6
1	A2M	CPAMD5, FWP007, S863-7	ENSG00000175899	Alpha-2-macroglobulin	P01023	12	9067664-9116229	Cancer-related genes, Candidate cardiovascular...	NaN	Protease inhibitor, Serine protease inhibitor	...	18.6	329.8	207.9	1.8	1.1	12.3	3.4	16.7	0.0	2.2
2	A2ML1	CPAMD9, FLJ25179, p170	ENSG00000166535	Alpha-2-macroglobulin like 1	A8K2U0	12	8822472-8887001	Predicted intracellular proteins, Predicted se...	NaN	Protease inhibitor, Serine protease inhibitor	...	0.2	0.0	0.3	1.4	0.0	11.0	0.1	0.0	0.0	4.9
3	A4GNT	alpha4GnT	ENSG00000118017	Alpha-1,4-N-acetylglucosaminyltransferase	Q9UNA3	3	138123718-138132387	Predicted intracellular proteins	NaN	Glycosyltransferase, Transferase	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	AADACL2	MGC72001	ENSG00000197953	Arylacetamide deacetylase like 2	Q6P093	3	151733916-151761339	Predicted secreted proteins	NaN	Hydrolase	...	0.0	0.0	0.0	0.0	0.0	26.8	0.1	0.0	0.0	0.0

5 rows × 292 columns

[42]:

secretome.shape

[42]:

(2641, 292)

[43]:

mapping_dict = mapping.getMappingFromDatabase(secretome['Uniprot'].tolist(), node='Protein', attribute_from='id', attribute_to='name')

[44]:

secretome['identifier'] = [str(mapping_dict[p])+"_"+p if p in mapping_dict else p for p in secretome['Uniprot'].tolist()]

[45]:

secretome.head()

[45]:

	Gene	Gene synonym	Ensembl	Gene description	Uniprot	Chromosome	Position	Protein class	Biological process	Molecular function	...	Single Cell Type RNA - Sertoli cells [NX]	Single Cell Type RNA - Smooth muscle cells [NX]	Single Cell Type RNA - Spermatocytes [NX]	Single Cell Type RNA - Spermatogonia [NX]	Single Cell Type RNA - Suprabasal keratinocytes [NX]	Single Cell Type RNA - Syncytiotrophoblasts [NX]	Single Cell Type RNA - T-cells [NX]	Single Cell Type RNA - Undifferentiated cells [NX]	Single Cell Type RNA - Urothelial cells [NX]	identifier
0	A1BG	NaN	ENSG00000121410	Alpha-1-B glycoprotein	P04217	19	58345178-58353499	Plasma proteins, Predicted intracellular prote...	NaN	NaN	...	25.8	52.4	1.4	5.1	3.1	1.6	34.8	0.1	0.6	A1BG_P04217
1	A2M	CPAMD5, FWP007, S863-7	ENSG00000175899	Alpha-2-macroglobulin	P01023	12	9067664-9116229	Cancer-related genes, Candidate cardiovascular...	NaN	Protease inhibitor, Serine protease inhibitor	...	329.8	207.9	1.8	1.1	12.3	3.4	16.7	0.0	2.2	A2M_P01023
2	A2ML1	CPAMD9, FLJ25179, p170	ENSG00000166535	Alpha-2-macroglobulin like 1	A8K2U0	12	8822472-8887001	Predicted intracellular proteins, Predicted se...	NaN	Protease inhibitor, Serine protease inhibitor	...	0.0	0.3	1.4	0.0	11.0	0.1	0.0	0.0	4.9	A2ML1_A8K2U0
3	A4GNT	alpha4GnT	ENSG00000118017	Alpha-1,4-N-acetylglucosaminyltransferase	Q9UNA3	3	138123718-138132387	Predicted intracellular proteins	NaN	Glycosyltransferase, Transferase	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	A4GNT_Q9UNA3
4	AADACL2	MGC72001	ENSG00000197953	Arylacetamide deacetylase like 2	Q6P093	3	151733916-151761339	Predicted secreted proteins	NaN	Hydrolase	...	0.0	0.0	0.0	0.0	26.8	0.1	0.0	0.0	0.0	AADACL2_Q6P093

5 rows × 293 columns

[46]:

secretome_cols = [c for c in processed_data.columns if c in secretome['identifier'].tolist()]

[47]:

sprocessed_data = processed_data[secretome_cols +['sample', 'group']]

[48]:

sprocessed_data.head()

[48]:

	DNASE2_O00115	QSOX1_O00391	ACACB_O00763	ISLR_O14498	HSPB6_O14558	NRP1_O14786	CALU_O43852	CUTA_O60888	SLIT3_O75094	CILP_O75339	...	DPP7_Q9UHL4	CPA4_Q9UI42	PCOLCE2_Q9UKZ9	EPDR1_Q9UM22	ITM2B_Q9Y287	LOXL2_Q9Y4K0	CSAD_Q9Y600	CPQ_Q9Y646	sample	group
0	24.114083	24.138056	30.128417	23.468740	29.996177	22.214779	31.531017	24.544570	22.014731	23.388637	...	20.294032	23.102487	22.018794	23.550629	24.847902	23.296214	24.888148	22.746304	BAT_NE1	BAT_NE
1	24.119415	25.387356	30.391512	23.489224	29.513447	23.831518	30.589780	24.264634	21.450402	26.813356	...	21.067531	21.805934	24.547985	22.870285	24.352738	23.289641	24.164153	22.171377	BAT_NE2	BAT_NE
2	24.161902	23.798872	29.734602	23.429947	29.180208	22.385902	30.160027	24.237612	27.257009	23.044385	...	20.676194	21.570111	23.086508	22.662654	24.072838	22.711249	25.221587	22.150598	BAT_NE3	BAT_NE
3	23.993172	24.679493	28.181251	23.892080	28.725074	23.909215	30.443124	24.276151	21.425337	29.299769	...	21.066661	21.807458	24.590637	23.047072	24.424457	24.206725	24.793329	22.693322	BAT_NE4	BAT_NE
4	24.193250	23.775452	30.909635	22.821237	29.365446	22.607023	30.262149	23.939482	21.384819	26.076288	...	20.465266	22.136305	23.911137	22.389163	24.190087	23.140772	25.171234	21.778724	BAT_NE5	BAT_NE

5 rows × 243 columns

[49]:

sprocessed_data.to_csv('{}/secreted_processed_data.tsv'.format(analysis_dir), sep='\t', index=False)

Differential Regulation¶

[50]:

alpha = 0.05
fc = 2
aov_result = analytics.run_anova(sprocessed_data, alpha=alpha, drop_cols=['sample'], subject=None, group='group', permutations=0, correction='fdr_bh', is_logged=True)

[51]:

aov_result.head()

[51]:

	identifier	group1	group2	mean(group1)	std(group1)	mean(group2)	std(group2)	posthoc Paired	posthoc Parametric	posthoc T-Statistics	...	FC	efftype	F-statistics	pvalue	padj	correction	rejected	-log10 pvalue	Method	posthoc padj
0	A1BG_P04217	BAT_NE	BAT_woNE	22.431113	1.053262	21.159121	0.166747	False	True	2.667213	...	2.414949	hedges	5.335302	0.009697	0.024255	FDR correction BH	True	1.545383	One-way anova	0.177929
1	A1BG_P04217	BAT_NE	WAT_NE	22.431113	1.053262	21.424311	0.425668	False	True	1.981715	...	2.009452	hedges	5.335302	0.009697	0.024255	FDR correction BH	True	1.081820	One-way anova	0.207934
2	A1BG_P04217	BAT_NE	WAT_woNE	22.431113	1.053262	22.382288	0.529607	False	True	0.092607	...	1.034422	hedges	5.335302	0.009697	0.024255	FDR correction BH	True	0.032221	One-way anova	0.960372
3	A1BG_P04217	BAT_woNE	WAT_NE	21.159121	0.166747	21.424311	0.425668	False	True	-1.297094	...	0.832089	hedges	5.335302	0.009697	0.024255	FDR correction BH	True	0.636851	One-way anova	0.394409
4	A1BG_P04217	BAT_woNE	WAT_woNE	21.159121	0.166747	22.382288	0.529607	False	True	-4.925975	...	0.428341	hedges	5.335302	0.009697	0.024255	FDR correction BH	True	2.937312	One-way anova	0.009281

5 rows × 26 columns

[52]:

aov_result.loc[aov_result.rejected].identifier.unique().shape

[52]:

(122,)

[53]:

aov_result.to_csv('{}/anova_results.tsv'.format(analysis_dir), sep='\t', index=False)

[54]:

args={'alpha':alpha,
      'fc':fc,
      'colorscale':'Blues',
      'showscale': False,
      'marker_size':10,
      'num_annotations':480,
      'x_title':'log2FC',
      'y_title':'-log10(pvalue)'}
figures = viz.run_volcano(aov_result, identifier='volcano', args=args)
i = 1
for figure in figures:
    if figure.figure['layout'].title['text'] == 'Comparison: BAT_woNE vs WAT_woNE':
        iplot(figure.figure)
        i += 1
        viz.save_DASH_plot(figure, name='BAT_woNE_vs_WAT_woNE_volcano_plot', plot_format='svg', directory=analysis_dir)

Analysis of Upregulated Proteins in Unestimulated BAT va WAT¶

[55]:

upregulated_proteins = aov_result[(aov_result['group1'] == 'BAT_woNE') &
                                  (aov_result['group2'] == 'WAT_woNE') &
                                  (aov_result['posthoc padj'] <= alpha) &
                                  (aov_result['log2FC'] >=1)]

[56]:

selected_df = df.set_index("group").loc[['BAT_woNE', 'WAT_woNE'], [p.split('_')[1] if len(p.split('_')) > 1 else p for p in upregulated_proteins.identifier.unique().tolist()]].reset_index()
selected_df.groupby("group").mean()

[56]:

identifier	P02749	P61769	Q07021	P13671	P12532	P10909	P53634	Q9UBR2	O00115	P26885	...	P13667	Q13162	P02753	P34096	Q99584	O95969	P31947	P10124	Q99727	Q13263
group
BAT_woNE	25.182105	30.143430	26.047659	24.232947	28.330328	27.380916	24.416099	25.82257	23.386786	23.663076	...	25.794822	23.634525	29.731995	26.620535	26.522128	24.171815	27.728345	24.140254	24.868147	24.125775
WAT_woNE	24.559642	28.309072	24.776456	NaN	NaN	25.401467	NaN	NaN	NaN	NaN	...	25.519324	NaN	28.364350	26.031432	25.062536	NaN	NaN	NaN	NaN	NaN

2 rows × 30 columns

[57]:

plot = viz.get_parallel_plot(selected_df,
                      identifier='parallel_plot',
                      args={"title": "Parallel Plot Upregulated Proteins (original data)", "zscore":False, "group":"group"})
iplot(plot.figure)
viz.save_DASH_plot(plot, name='original_data_parallel_plot_upregulated_proteins', plot_format='svg', directory=analysis_dir)

[58]:

selected_df = processed_data.set_index("group").loc[['BAT_woNE', 'WAT_woNE'], upregulated_proteins.identifier.unique().tolist()].reset_index()
selected_df.groupby("group").mean()

[58]:

	APOH_P02749	B2M_P61769	C1QBP_Q07021	C6_P13671	CKMT1A_P12532	CLU_P10909	CTSC_P53634	CTSZ_Q9UBR2	DNASE2_O00115	FKBP2_P26885	...	PDIA4_P13667	PRDX4_Q13162	RBP4_P02753	RNASE4_P34096	S100A13_Q99584	SCGB1D2_O95969	SFN_P31947	SRGN_P10124	TIMP4_Q99727	TRIM28_Q13263
group
BAT_woNE	25.182105	30.143430	26.047659	24.222532	28.330328	27.380916	24.421102	25.822570	23.376689	23.665083	...	25.794822	23.633922	29.731995	26.620535	26.522128	24.171815	27.728345	24.16699	24.899663	24.149016
WAT_woNE	23.212651	28.309072	22.956111	22.213605	22.204772	25.334937	23.257274	22.665958	22.361763	22.391318	...	23.120770	22.283522	28.364350	23.248443	25.036334	22.413116	22.685906	21.99328	22.562938	22.603027

2 rows × 30 columns

[59]:

plot = viz.get_parallel_plot(selected_df,
                      identifier='parallel_plot',
                      args={"title": "Parallel Plot Upregulated Proteins", "zscore":False, "group":"group"})
iplot(plot.figure)
viz.save_DASH_plot(plot, name='processed_data_parallel_plot_upregulated_proteins', plot_format='svg', directory=analysis_dir)

Analysis Downregulated Proteins in Unestimulated BAT va WAT¶

[60]:

downregulated_proteins = aov_result[(aov_result['group1'] == 'BAT_woNE') &
                                  (aov_result['group2'] == 'WAT_woNE') &
                                  (aov_result['posthoc padj'] <= alpha) &
                                  (aov_result['log2FC'] <= -1)]

[61]:

selected_df = df.set_index("group").loc[['BAT_woNE', 'WAT_woNE'], [p.split('_')[1] if len(p.split('_')) > 1 else p for p in downregulated_proteins.identifier.unique().tolist()]].reset_index()
selected_df.groupby("group").mean()

[61]:

identifier	P04217	P39059	P39060	P12111	Q9Y646	P10619	P07711	Q07507	O94769	Q92520	...	Q8NHP8	P42785	Q9BRX8	P50454	Q9HAT2	P78539	P10646	P02786	Q15582	P24821
group
BAT_woNE	NaN	23.520278	NaN	30.474014	22.265727	23.106582	24.691102	24.647780	NaN	NaN	...	22.702228	24.413882	21.646408	26.485985	23.519893	25.031933	22.578169	21.273611	NaN	22.252515
WAT_woNE	NaN	24.732646	23.784856	31.820459	23.504914	24.305533	25.785085	25.878753	NaN	NaN	...	24.245612	25.850791	23.023065	27.525384	23.439551	26.176678	NaN	NaN	24.882087	24.936074

2 rows × 30 columns

[62]:

plot = viz.get_parallel_plot(selected_df,
                      identifier='parallel_plot',
                      args={"title": "Parallel Plot Downregulated Proteins (original data)", "zscore":False, "group":"group"})
iplot(plot.figure)
viz.save_DASH_plot(plot, name='original_data_parallel_plot_upregulated_proteins', plot_format='svg', directory=analysis_dir)

[63]:

selected_df = processed_data.set_index("group").loc[['BAT_woNE', 'WAT_woNE'], downregulated_proteins.identifier.unique().tolist()]
selected_df.groupby("group").mean()

[63]:

	A1BG_P04217	COL15A1_P39059	COL18A1_P39060	COL6A3_P12111	CPQ_Q9Y646	CTSA_P10619	CTSL_P07711	DPT_Q07507	ECM2_O94769	FAM3C_Q92520	...	PLBD2_Q8NHP8	PRCP_P42785	PRXL2A_Q9BRX8	SERPINH1_P50454	SIAE_Q9HAT2	SRPX_P78539	TFPI_P10646	TFRC_P02786	TGFBI_Q15582	TNC_P24821
group
BAT_woNE	21.159121	23.515881	20.742757	30.474014	22.284273	23.135425	24.691102	24.652254	21.251165	20.547066	...	21.122663	24.413882	21.406060	26.462812	21.796696	25.050929	21.369273	21.272058	21.429750	21.270361
WAT_woNE	22.382288	24.767523	23.791789	31.820459	23.447894	24.320570	25.777819	25.803667	23.080337	22.831844	...	23.125767	25.880017	22.730218	27.525384	23.460160	26.176678	23.005681	22.629342	24.927593	25.012742

2 rows × 30 columns

[64]:

plot = viz.get_parallel_plot(selected_df,
                      identifier='parallel_plot',
                      args={"title": "Parallel Plot Downregulated Proteins", "zscore":False, "group":"group"})
iplot(plot.figure)
viz.save_DASH_plot(plot, name='processed_data_parallel_plot_upregulated_proteins', plot_format='svg', directory=analysis_dir)

Functional Enrichment – GO Biological Processes¶

[65]:

annotation_query = 'MATCH (p:Protein)-[r:ASSOCIATED_WITH]-(bp:Biological_process) WHERE (p.name +"_"+p.id) IN [{}] RETURN DISTINCT (p.name +"_"+p.id) AS identifier, bp.name AS annotation'
prots = ["'{}'".format(p) for p in processed_data.drop(['sample', 'group'], axis=1).columns.unique().tolist()]

[66]:

driver = connector.getGraphDatabaseConnectionConfiguration()
annotation = connector.getCursorData(driver, annotation_query.format(",".join(prots)))

[67]:

annotation.head()

[67]:

	annotation	identifier
0	mitochondrial genome maintenance	TYMP_P19971
1	regulation of DNA recombination	ALYREF_Q86V81
2	maltose metabolic process	GAA_P10253
3	ribosomal large subunit assembly	RPL11_P62913
4	ribosomal large subunit assembly	RPL5_P46777

[68]:

enrichment = analytics.run_up_down_regulation_enrichment(aov_result, annotation, identifier='identifier', groups=['group1', 'group2'], annotation_col='annotation', reject_col='rejected', group_col='group', method='fisher', correction='fdr_bh', alpha=0.01, lfc_cutoff=1)

C:\Users\sande\.conda\envs\ckgenv\lib\site-packages\pandas\core\frame.py:6692: FutureWarning:

Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.


C:\Users\sande\.conda\envs\ckgenv\lib\site-packages\pandas\core\frame.py:6692: FutureWarning:

Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

[69]:

figures = viz.get_enrichment_plots(enrichment, identifier='enrichment', args={'width':2200})
i = 0
for fig in figures:
    iplot(fig.figure)
    viz.save_DASH_plot(fig, name="enrichment_"+str(i), plot_format='png', directory='.')
    i += 1

Generating Knowledge Graph¶

All the connections stored in CKG’s knowledge graph can be used to annotate the list of differentially regulated proteins. These annotations based on associations to Biological processes, diseases, tissues, protein complexes or drugs can be used to build a knowledge subgraph specific for this study.

[70]:

project_knowledge = knowledge.Knowledge(identifier='Batokines',
                              data=None,
                              nodes={},
                              relationships={},
                              queries_file=None,
                              colors={},
                              graph=None,
                              report={})

[71]:

regulation = upregulated_proteins[['identifier', 'log2FC']]
regulation['reg'] = 'Upregulated'

[72]:

project_knowledge.generate_knowledge_from_edgelist(edgelist=regulation,
                                                   entity1='upregulated',
                                                   entity2='Protein',
                                                   source='reg',
                                                   target='identifier',
                                                   rtype='is_upregulated',
                                                   weight='log2FC')

[73]:

regulation = downregulated_proteins[['identifier', 'log2FC']]
regulation['log2FC'] = np.abs(regulation['log2FC'])
regulation['reg'] = 'Downregulated'

[74]:

project_knowledge.generate_knowledge_from_edgelist(edgelist=regulation,
                                                   entity1='downregulated',
                                                   entity2='Protein',
                                                   source='reg',
                                                   target='identifier',
                                                   rtype='is_downregulated',
                                                   weight='log2FC')

[75]:

nodes, rels = project_knowledge.generate_knowledge_from_enrichment(data=enrichment, name="Biologicalprocesses")
project_knowledge.update_nodes(nodes)
project_knowledge.update_relationships(rels)

Disease Associations¶

[76]:

disease_query = 'MATCH (p:Protein)-[r:ASSOCIATED_WITH]->(d:Disease)-[:HAS_PARENT*2..4]->(pd:Disease) WHERE toFloat(r.score)>1.5 AND (p.name+"_"+p.id) IN [{}] AND pd.name="disease of metabolism" RETURN (p.name+"_"+p.id) AS protein, d.name AS disease, d.id AS disease_id, r.score AS score'
regulated_prots = ["'{}'".format(p) for p in upregulated_proteins.identifier.unique().tolist() + downregulated_proteins.identifier.unique().tolist()]
associations = connector.getCursorData(driver, disease_query.format(",".join(regulated_prots)))
diseases = associations['disease_id'].unique().tolist()

[77]:

associations.head()

[77]:

	disease	disease_id	protein	score
0	amyloidosis	DOID:9120	ITM2B_Q9Y287	4.543
1	amyloidosis	DOID:9120	IDE_P14735	2.600
2	amyloidosis	DOID:9120	GRN_P28799	1.597
3	amyloidosis	DOID:9120	B2M_P61769	4.647
4	amyloidosis	DOID:9120	MFGE8_Q08431	4.314

[78]:

net = viz.get_network(associations,
                      identifier='disease_network',
                      args={'source': 'protein',
                            'target': 'disease',
                            'values': 'score',
                            'node_size': 'degree',
                            'title': 'Disease Network',
                            'color_weight': False})
viz.visualize_notebook_network(net['notebook'])

[79]:

project_knowledge.generate_knowledge_from_edgelist(edgelist=associations,
                                                   entity1='Protein',
                                                   entity2='Disease',
                                                   source='protein',
                                                   target='disease',
                                                   rtype='associated_with',
                                                   weight='score')

Tissue Associations¶

Brown Adipose Tissue¶

[80]:

tissue_query = 'MATCH (p:Protein)-[r:ASSOCIATED_WITH]->(t:Tissue) WHERE t.id="BTO:0000156" AND (p.name+"_"+p.id) IN [{}] RETURN (p.name+"_"+p.id) AS protein, t.name AS tissue, r.score AS score'

associations = connector.getCursorData(driver, tissue_query.format(",".join(regulated_prots)))

associations

[81]:

project_knowledge.generate_knowledge_from_edgelist(edgelist=associations,
                                                   entity1='Protein',
                                                   entity2='Tissue',
                                                   source='protein',
                                                   target='tissue',
                                                   rtype='associated_with',
                                                   weight='score')

White Adipose Tissue¶

[82]:

tissue_query = 'MATCH (p:Protein)-[r:ASSOCIATED_WITH]->(t:Tissue) WHERE t.id="BTO:0001456" AND (p.name+"_"+p.id) IN [{}] RETURN (p.name+"_"+p.id) AS protein, t.name AS tissue, r.score AS score'

associations = connector.getCursorData(driver, tissue_query.format(",".join(regulated_prots)))

[83]:

associations.head()

[83]:

	protein	score	tissue
0	COL15A1_P39059	0.725	white adipose tissue
1	B2M_P61769	1.609	white adipose tissue
2	SERPINH1_P50454	1.049	white adipose tissue
3	COL6A3_P12111	1.650	white adipose tissue
4	CTSL_P07711	1.310	white adipose tissue

[84]:

project_knowledge.generate_knowledge_from_edgelist(edgelist=associations,
                                                   entity1='Protein',
                                                   entity2='Tissue',
                                                   source='protein',
                                                   target='tissue',
                                                   rtype='associated_with',
                                                   weight='score')

Protein Complexes¶

[85]:

complexes_query = 'MATCH (p:Protein)-[r:IS_SUBUNIT_OF]->(c:Complex) WHERE (p.name+"_"+p.id) IN [{}] RETURN (p.name+"_"+p.id) AS protein, c.name AS complex, r.score AS score'

associations = connector.getCursorData(driver, complexes_query.format(",".join(regulated_prots)))
associations['score'] = 0.0

[86]:

associations.head()

[86]:

	complex	protein
0	NCOR1 complex	TRIM28_Q13263
1	HDAC10-PAX3-KAP1 complex	TRIM28_Q13263
2	ITGA9-ITGB1-TNC complex	TNC_P24821
3	Emerin complex 24	C1QBP_Q07021
4	TF-TFRC complex	TFRC_P02786

[87]:

project_knowledge.generate_knowledge_from_edgelist(edgelist=associations,
                                                   entity1='Protein',
                                                   entity2='Complex',
                                                   source='protein',
                                                   target='complex',
                                                   rtype='is_subunit_of',
                                                   weight='score')

Drug Associations¶

[88]:

drugs_query = 'MATCH (d:Drug)-[r:ACTS_ON]->(p:Protein) WHERE toFloat(r.score)>0.7 AND (p.name+"_"+p.id) IN [{}] RETURN (p.name+"_"+p.id) AS protein, d.name AS drug, r.score AS score'
associations = connector.getCursorData(driver, drugs_query.format(",".join(regulated_prots)))

[89]:

associations.head()

[89]:

	drug	protein	score
0	Formic acid	TFRC_P02786	0.800
1	Magnesium cation	B2M_P61769	0.958
2	Magnesium cation	HEXB_P07686	0.836
3	Magnesium cation	HEXA_P06865	0.836
4	Magnesium cation	KARS1_Q15046	0.800

[90]:

project_knowledge.generate_knowledge_from_edgelist(edgelist=associations,
                                                   entity1='Protein',
                                                   entity2='Drug',
                                                   source='protein',
                                                   target='drug',
                                                   rtype='acts_on',
                                                   weight='score')

Visualize the Knowledge Graph¶

[91]:

project_knowledge.generate_report(visualizations=['sankey'], summarize=True)
project_knowledge.report.visualize_report(environment='notebook')

[91]:

[]

[92]:

project_knowledge.generate_report(visualizations=['network'], summarize=True)
project_knowledge.report.visualize_report(environment='notebook')[0]